Search CORE

3 research outputs found

Finite Sample Analysis of Mean-Volatility Actor-Critic for Risk-Averse Reinforcement Learning

Author: Bisi Lorenzo
Eldowa Khaled
Restelli Marcello
Publication venue: Proceedings of Machine Learning Research
Publication date: 01/01/2022
Field of study

The goal in the standard reinforcement learning problem is to find a policy that optimizes the expected return. However, such an objective is not adequate in a lot of real-life applications, like finance, where controlling the uncertainty of the outcome is imperative. The mean-volatility objective penalizes, through a tunable parameter, policies with high variance of the per-step reward. An interesting property of this objective is that it admits simple linear Bellman equations that resemble, up to a reward transformation, those of the risk-neutral case. However, the required reward transformation is policy-dependent, and requires the (usually unknown) expected return of the used policy. In this work, we propose two general methods for policy evaluation under the mean-volatility objective: the direct method and the factored method. We then extend recent results for finite sample analysis in the risk-neutral actor-critic setting to the mean-volatility case. Our analysis shows that the sample complexity to attain an ϵ-accurate stationary point is the same as that of the risk-neutral version, using either policy evaluation method for training the critic. Finally, we carry out experiments to test the proposed methods in a simple environment that exhibits some trade-off between optimality, in expectation, and uncertainty of outcome

Archivio istituzionale della ricerca - Politecnico di Milano

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

On the Minimax Regret for Online Learning with Feedback Graphs

Author: Cesa-Bianchi Nicolò
Cesari Tommaso
Eldowa Khaled
Esposito Emmanuel
Publication venue
Publication date: 24/05/2023
Field of study

In this work, we improve on the upper and lower bounds for the regret of online learning with strongly observable undirected feedback graphs. The best known upper bound for this problem is

\mathcal{O}\bigl(\sqrt{\alpha T\ln K}\bigr)

, where

K

is the number of actions,

\alpha

is the independence number of the graph, and

T

is the time horizon. The

\sqrt{\ln K}

factor is known to be necessary when

\alpha = 1

(the experts case). On the other hand, when

\alpha = K

(the bandits case), the minimax rate is known to be

\Theta\bigl(\sqrt{KT}\bigr)

, and a lower bound

\Omega\bigl(\sqrt{\alpha T}\bigr)

is known to hold for any

\alpha

. Our improved upper bound

\mathcal{O}\bigl(\sqrt{\alpha T(1+\ln(K/\alpha))}\bigr)

holds for any

\alpha

and matches the lower bounds for bandits and experts, while interpolating intermediate cases. To prove this result, we use FTRL with

q

-Tsallis entropy for a carefully chosen value of

q \in [1/2, 1)

that varies with

\alpha

. The analysis of this algorithm requires a new bound on the variance term in the regret. We also show how to extend our techniques to time-varying graphs, without requiring prior knowledge of their independence numbers. Our upper bound is complemented by an improved

\Omega\bigl(\sqrt{\alpha T(\ln K)/(\ln\alpha)}\bigr)

lower bound for all

\alpha > 1

, whose analysis relies on a novel reduction to multitask learning. This shows that a logarithmic factor is necessary as soon as

\alpha < K

arXiv.org e-Print Archive

Information-Theoretic Regret Bounds for Bandits with Fixed Expert Advice

Author: Cesa-Bianchi Nicolò
Eldowa Khaled
Metelli Alberto Maria
Restelli Marcello
Publication venue
Publication date: 01/01/2023
Field of study

We investigate the problem of bandits with expert advice when the experts are fixed and known distributions over the actions. Improving on previous analyses, we show that the regret in this setting is controlled by information-theoretic quantities that measure the similarity between experts. In some natural special cases, this allows us to obtain the first regret bound for EXP4 that can get arbitrarily close to zero if the experts are similar enough. While for a different algorithm, we provide another bound that describes the similarity between the experts in terms of the KL-divergence, and we show that this bound can be smaller than the one of EXP4 in some cases. Additionally, we provide lower bounds for certain classes of experts showing that the algorithms we analyzed are nearly optimal in some cases

Archivio istituzionale della ricerca - Politecnico di Milano

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)